CNV-80608: add CRD support and create alert rule API#841
CNV-80608: add CRD support and create alert rule API#841sradco wants to merge 1 commit intoopenshift:main-alerts-management-apifrom
Conversation
WalkthroughThis pull request introduces a comprehensive alert-rule management API for Kubernetes/OpenShift monitoring. The changes add support for creating user-defined and platform alert rules through HTTP endpoints, with Kubernetes resource management, conflict detection, external management enforcement, and extensive validation. Build configuration updates enable compilation against new internal packages. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@sradco: This pull request references CNV-80608 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sradco The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @jgbernalp , @jan--f , @simonpasquier, I restructured the code so this PR include the first API, |
9688dd6 to
2767365
Compare
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
2767365 to
a717988
Compare
Add AlertingRule, AlertRelabelConfig, and RelabeledRules CRD interfaces with the management client, router, server wiring, and POST /api/v1/alerting/rules endpoint. Signed-off-by: Shirly Radco <sradco@redhat.com> Signed-off-by: João Vilaça <jvilaca@redhat.com> Signed-off-by: Aviv Litman <alitman@redhat.com> Co-authored-by: AI Assistant <noreply@cursor.com>
a717988 to
80147f1
Compare
|
@sradco: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (8)
pkg/k8s/alert_relabel_config.go (1)
95-102: Inconsistent error message in Delete - missing namespace.The
CreateandUpdatemethods include both namespace and name in their error messages (%s/%s), butDeleteonly includes the name. This inconsistency could make debugging harder when multiple namespaces contain resources with the same name.Proposed fix
func (arcm *alertRelabelConfigManager) Delete(ctx context.Context, namespace string, name string) error { err := arcm.clientset.MonitoringV1().AlertRelabelConfigs(namespace).Delete(ctx, name, metav1.DeleteOptions{}) if err != nil { - return fmt.Errorf("failed to delete AlertRelabelConfig %s: %w", name, err) + return fmt.Errorf("failed to delete AlertRelabelConfig %s/%s: %w", namespace, name, err) } return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/k8s/alert_relabel_config.go` around lines 95 - 102, The Delete method on alertRelabelConfigManager returns an error message that only includes the resource name; update alertRelabelConfigManager.Delete to include both namespace and name in the error string (matching the Create/Update pattern) by formatting the error as "%s/%s" (namespace/name) when wrapping the returned error from arcm.clientset.MonitoringV1().AlertRelabelConfigs(namespace).Delete; ensure the error message text mirrors the existing style used in Create/Update for consistency.internal/managementrouter/create_alert_rule.go (1)
21-26: Consider limiting request body size.The handler decodes the request body without size limits. A malicious client could send an extremely large payload, potentially causing memory exhaustion. Consider using
http.MaxBytesReaderto limit the request body size.Proposed fix
func (hr *httpRouter) CreateAlertRule(w http.ResponseWriter, req *http.Request) { + // Limit request body to 1MB to prevent memory exhaustion + req.Body = http.MaxBytesReader(w, req.Body, 1<<20) + var payload CreateAlertRuleRequest if err := json.NewDecoder(req.Body).Decode(&payload); err != nil { writeError(w, http.StatusBadRequest, "invalid request body") return }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/managementrouter/create_alert_rule.go` around lines 21 - 26, The CreateAlertRule handler decodes req.Body without any size limit, risking memory exhaustion; wrap the request body with http.MaxBytesReader (e.g., maxBytes constant like 1<<20 for 1MB) before decoding in CreateAlertRule on the httpRouter, then decode into CreateAlertRuleRequest from that wrapped reader and handle a size-exceeded error by calling writeError(w, http.StatusRequestEntityTooLarge, "...") (or appropriate message) so overly large payloads are rejected gracefully.internal/managementrouter/router.go (1)
62-64: Consider limiting error details in 500 responses.The current implementation returns the full internal error message to the client on unexpected errors. While this aids debugging, it may leak implementation details (e.g., database connection strings, internal paths) in production.
Proposed fix - return generic message to client
log.Printf("An unexpected error occurred: %v", err) - return http.StatusInternalServerError, fmt.Sprintf("An unexpected error occurred: %s", err.Error()) + return http.StatusInternalServerError, "An unexpected error occurred"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/managementrouter/router.go` around lines 62 - 64, The handler currently returns the full internal error string in 500 responses; update the code that returns http.StatusInternalServerError (the block using log.Printf("An unexpected error occurred: %v", err) and return http.StatusInternalServerError, fmt.Sprintf("An unexpected error occurred: %s", err.Error())) to keep detailed error logging server-side but send a generic client message instead (e.g., log the full err with log.Printf or a logger, then return http.StatusInternalServerError and a constant message like "An unexpected error occurred" without including err.Error()) so internal details aren’t leaked.pkg/alert_rule/alert_rule.go (1)
69-71: Minor: redundantTrimSpacecall.
keywas already trimmed on line 61. The secondTrimSpaceon line 69 is unnecessary.♻️ Suggested simplification
- if !promLabelNameRegexp.MatchString(strings.TrimSpace(key)) { + if !promLabelNameRegexp.MatchString(key) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/alert_rule/alert_rule.go` around lines 69 - 71, The second strings.TrimSpace call is redundant because key was already trimmed earlier; in the block using promLabelNameRegexp.MatchString replace strings.TrimSpace(key) with key to avoid unnecessary allocation while preserving behavior—look for the check involving promLabelNameRegexp and the variable key in alert_rule.go and remove the extra TrimSpace.pkg/management/create_platform_alert_rule.go (1)
101-119: Consider documenting the alert name conflict check behavior.
addRuleToGroupchecks for alert name conflicts (line 107) but not for spec equivalence (unlikeCreateUserDefinedAlertRulewhich checks both). This means two platform rules with different names but identical specs could potentially be added.The duplicate check on line 29-31 via
RelabeledRules().Get()should catch exact ID matches, but the relabeled rules cache is eventually consistent. If this asymmetry is intentional, a brief comment explaining the design choice would help future maintainers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/management/create_platform_alert_rule.go` around lines 101 - 119, The addRuleToGroup function only prevents duplicate alert names (in addRuleToGroup) but not duplicate rule specs (unlike CreateUserDefinedAlertRule which checks spec equivalence), creating an intentional asymmetry due to eventual consistency of RelabeledRules().Get(); update the code by adding a concise in-code comment inside addRuleToGroup (near the name-conflict check and before appending the new group) that explains this design choice: that we only check Alert name here, that spec equivalence is validated elsewhere (CreateUserDefinedAlertRule and via RelabeledRules().Get()), and that the relabeled cache is eventually consistent so duplicate-spec prevention is handled at a different layer rather than here; reference addRuleToGroup and RelabeledRules().Get() in the comment so future maintainers can locate the related logic.pkg/management/create_user_defined_alert_rule.go (1)
97-124: Consider normalizing expression comparison for equivalence check.
rulesHaveEquivalentSpeccompares expressions viaa.Expr.String() != b.Expr.String(), which is sensitive to whitespace differences. However,GetAlertingRuleIdnormalizes expressions (collapses whitespace) before hashing.This means two rules with equivalent normalized expressions but different whitespace would:
- Have the same computed ID (duplicate detected at line 32-35)
- But not be flagged as equivalent spec here (lines 100-102)
If this is intentional (exact byte-for-byte match for spec equivalence), it's fine. Otherwise, consider reusing
normalizeExpr:+import alertrule "github.com/openshift/monitoring-plugin/pkg/alert_rule" + func rulesHaveEquivalentSpec(a, b monitoringv1.Rule) bool { - if a.Expr.String() != b.Expr.String() { + if alertrule.NormalizeExpr(a.Expr.String()) != alertrule.NormalizeExpr(b.Expr.String()) { return false }Note: This would require exporting
normalizeExprfromalertrulepackage.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/management/create_user_defined_alert_rule.go` around lines 97 - 124, rulesHaveEquivalentSpec currently compares PromQL expressions with a raw string comparison (a.Expr.String() != b.Expr.String()), which is whitespace-sensitive and can mismatch rules that GetAlertingRuleId treats as identical; change rulesHaveEquivalentSpec to compare normalized expressions by applying the same normalizeExpr used by GetAlertingRuleId (or make normalizeExpr exported from the alertrule package and call alertrule.NormalizeExpr here), then compare the normalized strings; ensure imports are updated and keep existing checks for For and filterBusinessLabels unchanged so equivalence now uses normalized expressions.pkg/k8s/alerting_rule.go (1)
100-107: Consider idempotent delete behavior for NotFound errors.The
Deletemethod returns an error when the resource is not found, whereasGetreturns(nil, false, nil)for the same case. For idempotent delete semantics (common in reconciliation loops), you might want to suppressNotFounderrors:func (arm *alertingRuleManager) Delete(ctx context.Context, name string) error { err := arm.clientset.MonitoringV1().AlertingRules(ClusterMonitoringNamespace).Delete(ctx, name, metav1.DeleteOptions{}) if err != nil { + if errors.IsNotFound(err) { + return nil + } return fmt.Errorf("failed to delete AlertingRule %s/%s: %w", ClusterMonitoringNamespace, name, err) } return nil }If callers need to know whether the resource existed, this change would be inappropriate—keep the current behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/k8s/alerting_rule.go` around lines 100 - 107, The Delete method on alertingRuleManager returns an error when the resource is NotFound; change it to be idempotent by checking the delete error from arm.clientset.MonitoringV1().AlertingRules(ClusterMonitoringNamespace).Delete(...) and return nil if it is a NotFound error (use apierrors.IsNotFound). Ensure you add the apierrors import (k8s.io/apimachinery/pkg/api/errors) and keep returning wrapped errors for other failure cases so only NotFound is suppressed.pkg/management/create_platform_alert_rule_test.go (1)
97-123: Consider adding test for operator-managed AlertingRule rejection.The test suite covers GitOps-managed rejection (lines 98-123), but there's no test case for operator-managed AlertingRule rejection. The implementation in
create_platform_alert_rule.go(lines 47-51) handles both cases, but only one is tested.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/management/create_platform_alert_rule_test.go` around lines 97 - 123, Add a test case parallel to the GitOps-managed one that verifies CreatePlatformAlertRule rejects operator-managed AlertingRule: in pkg/management/create_platform_alert_rule_test.go add an It block under the existing Context("when target AlertingRule exists") that configures mockK8s.AlertingRulesFunc to return an osmv1.AlertingRule whose ObjectMeta contains the operator-management annotation/key that create_platform_alert_rule.go checks (the same annotation/key used in the implementation), call client.CreatePlatformAlertRule(ctx, baseRule), and assert the call returns an error containing the same "managed by" rejection message; mirror the structure of the existing GitOps test and use the same MockAlertingRuleInterface GetFunc pattern so the operator-managed branch in CreatePlatformAlertRule is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@go.mod`:
- Line 3: go.mod currently declares "go 1.25.0" which mismatches the project
builder images referencing Go 1.24; update either the builder images to use Go
1.25 or revert the go directive in go.mod to 1.24 so CI and dev containers use
the same toolchain. Locate the go directive in go.mod ("go 1.25.0") and change
it to "go 1.24" to match the Dockerfiles (or update the Dockerfiles to use
openshift-golang-builder with 1.25 consistently), then run go mod tidy and
verify builds locally to ensure the change resolves the mismatch.
In `@pkg/k8s/relabeled_rules.go`:
- Around line 418-427: The List method on relabeledRulesManager currently
returns shallow copies of cached monitoringv1.Rule objects (rrm.relabeledRules),
exposing shared Labels/Annotations/For maps and *relabel.Config pointers; change
it to return deep copies: while holding the lock, for each rule produce a full
deep copy (use the Rule.DeepCopy() helper if available, otherwise copy primitive
fields and make new maps for Labels/Annotations, new slices for For/Exprs, and
clone each relabel.Config into a new instance) and append the deep copy to the
result slice; apply the same deep-copy approach to the other cache-returning
method(s) referenced around lines 430-446 (e.g., any List*/Get* that returns
cached rules or relabel configs) so callers never receive shared internal
pointers or maps after the lock is released.
- Around line 142-150: After cache sync returns, ensure the manager performs an
initial successful sync before being exposed: call the existing sync(ctx) (or
trigger/wait for the first worker pass) after launching rrm.worker(ctx) and
before returning from the constructor so relabeledRules is populated;
specifically, replace or augment the current rrm.queue.Add("initial-sync")
behavior by invoking rrm.sync(ctx) (or blocking until rrm.worker's first pass
completes) and only return the manager once that initial sync completes
successfully to prevent early POST requests from missing existing-rule
conflicts.
- Around line 269-317: The loop is mutating the informer-cached PrometheusRule
because `rule` is a shallow copy and writes to `rule.Labels` modify the
original; fix by cloning the labels map before any writes: in the loop over
`group.Rules`, if `rule.Labels` is non-nil create a new map and copy all
key/value pairs into it (otherwise allocate a new map), assign that new map back
to `rule.Labels`, then proceed to add synthetic labels (e.g. setting
managementlabels.AlertNameLabel, AlertRuleLabelId,
PrometheusRuleLabelNamespace/Name and managed-by labels) and run relabel
processing (`labels.NewBuilder` / `relabel.ProcessBuilder`) against the cloned
map so you never mutate the informer-cached object.
In `@pkg/management/alert_rule_preconditions.go`:
- Around line 80-98: validatePlatformUpdatePreconditions currently ignores the
operatorManaged boolean from k8s.IsExternallyManagedObject and doesn't check
operator-managed status for arc, allowing operator-managed resources to be
updated; update the function to mirror validatePlatformDeletePreconditions by
capturing both return values from k8s.IsExternallyManagedObject(pr) and
returning the same rejection when operatorManaged is true, and add an
operator-managed check for arc (using the corresponding k8s helper that checks
annotations/labels) to return the same not-allowed error when arc is
operator-managed.
In `@pkg/management/management.go`:
- Around line 13-14: The current IsPlatformAlertRule(prId types.NamespacedName)
only checks Namespace().IsClusterMonitoringNamespace and ignores the specific
PrometheusRule identity, causing false positives; update IsPlatformAlertRule to
fetch the PrometheusRule identified by prId (e.g., via
c.k8sClient.Get/PrometheusRule client), then determine platform ownership from
the rule’s metadata (ownerReferences where Controller==true and kind indicates
platform controller, or a platform-specific annotation such as a
managed-by/owned-by key), and only return true when that explicit provenance
indicates platform ownership; if fetching the resource is impossible, fall back
to a stricter namespace policy (e.g., a whitelist of known platform namespaces)
rather than the broad cluster-monitoring label; update any callers like
CreateUserDefinedAlertRule to use the revised IsPlatformAlertRule
signature/behavior.
In `@pkg/management/testutils/k8s_client_mock.go`:
- Around line 16-23: The accessors on MockClient (PrometheusRulesFunc,
AlertRelabelConfigsFunc, AlertingRulesFunc, RelabeledRulesFunc, NamespaceFunc)
currently return a new mock on each call; change them to lazily initialize and
memoize one backing mock instance per interface on the MockClient struct (e.g.,
add fields like prometheusRules, alertRelabelConfigs, alertingRules,
relabeledRules, namespace) and have each accessor check the backing field,
create and store the mock if nil, then return the stored instance so repeated
calls (e.g., AlertingRules().Create(...) followed by AlertingRules().Get(...))
operate on the same mock store; keep TestConnectionFunc behavior unchanged.
In `@pkg/server.go`:
- Around line 153-167: The current synchronous initialization of the Kubernetes
management client (managementClient) uses k8s.NewClient and
k8sClient.TestConnection under the long-lived server context and can block
startup; wrap the creation and connection check in a bounded initialization
(e.g., create a child context with a short timeout or deadline and use that for
k8s.NewClient and k8sClient.TestConnection) and fail fast if the timeout
elapses, or alternatively refactor so NewClient returns a client that starts
informers asynchronously (no cache-sync wait) and performs cache sync in the
background while the server continues to start; update the code paths
referencing managementClient to handle startup-time timeout failures or a
not-yet-synced informer client accordingly (references: managementClient,
k8s.NewClient, k8sClient.TestConnection, pkg/k8s/client.go).
---
Nitpick comments:
In `@internal/managementrouter/create_alert_rule.go`:
- Around line 21-26: The CreateAlertRule handler decodes req.Body without any
size limit, risking memory exhaustion; wrap the request body with
http.MaxBytesReader (e.g., maxBytes constant like 1<<20 for 1MB) before decoding
in CreateAlertRule on the httpRouter, then decode into CreateAlertRuleRequest
from that wrapped reader and handle a size-exceeded error by calling
writeError(w, http.StatusRequestEntityTooLarge, "...") (or appropriate message)
so overly large payloads are rejected gracefully.
In `@internal/managementrouter/router.go`:
- Around line 62-64: The handler currently returns the full internal error
string in 500 responses; update the code that returns
http.StatusInternalServerError (the block using log.Printf("An unexpected error
occurred: %v", err) and return http.StatusInternalServerError, fmt.Sprintf("An
unexpected error occurred: %s", err.Error())) to keep detailed error logging
server-side but send a generic client message instead (e.g., log the full err
with log.Printf or a logger, then return http.StatusInternalServerError and a
constant message like "An unexpected error occurred" without including
err.Error()) so internal details aren’t leaked.
In `@pkg/alert_rule/alert_rule.go`:
- Around line 69-71: The second strings.TrimSpace call is redundant because key
was already trimmed earlier; in the block using promLabelNameRegexp.MatchString
replace strings.TrimSpace(key) with key to avoid unnecessary allocation while
preserving behavior—look for the check involving promLabelNameRegexp and the
variable key in alert_rule.go and remove the extra TrimSpace.
In `@pkg/k8s/alert_relabel_config.go`:
- Around line 95-102: The Delete method on alertRelabelConfigManager returns an
error message that only includes the resource name; update
alertRelabelConfigManager.Delete to include both namespace and name in the error
string (matching the Create/Update pattern) by formatting the error as "%s/%s"
(namespace/name) when wrapping the returned error from
arcm.clientset.MonitoringV1().AlertRelabelConfigs(namespace).Delete; ensure the
error message text mirrors the existing style used in Create/Update for
consistency.
In `@pkg/k8s/alerting_rule.go`:
- Around line 100-107: The Delete method on alertingRuleManager returns an error
when the resource is NotFound; change it to be idempotent by checking the delete
error from
arm.clientset.MonitoringV1().AlertingRules(ClusterMonitoringNamespace).Delete(...)
and return nil if it is a NotFound error (use apierrors.IsNotFound). Ensure you
add the apierrors import (k8s.io/apimachinery/pkg/api/errors) and keep returning
wrapped errors for other failure cases so only NotFound is suppressed.
In `@pkg/management/create_platform_alert_rule_test.go`:
- Around line 97-123: Add a test case parallel to the GitOps-managed one that
verifies CreatePlatformAlertRule rejects operator-managed AlertingRule: in
pkg/management/create_platform_alert_rule_test.go add an It block under the
existing Context("when target AlertingRule exists") that configures
mockK8s.AlertingRulesFunc to return an osmv1.AlertingRule whose ObjectMeta
contains the operator-management annotation/key that
create_platform_alert_rule.go checks (the same annotation/key used in the
implementation), call client.CreatePlatformAlertRule(ctx, baseRule), and assert
the call returns an error containing the same "managed by" rejection message;
mirror the structure of the existing GitOps test and use the same
MockAlertingRuleInterface GetFunc pattern so the operator-managed branch in
CreatePlatformAlertRule is exercised.
In `@pkg/management/create_platform_alert_rule.go`:
- Around line 101-119: The addRuleToGroup function only prevents duplicate alert
names (in addRuleToGroup) but not duplicate rule specs (unlike
CreateUserDefinedAlertRule which checks spec equivalence), creating an
intentional asymmetry due to eventual consistency of RelabeledRules().Get();
update the code by adding a concise in-code comment inside addRuleToGroup (near
the name-conflict check and before appending the new group) that explains this
design choice: that we only check Alert name here, that spec equivalence is
validated elsewhere (CreateUserDefinedAlertRule and via RelabeledRules().Get()),
and that the relabeled cache is eventually consistent so duplicate-spec
prevention is handled at a different layer rather than here; reference
addRuleToGroup and RelabeledRules().Get() in the comment so future maintainers
can locate the related logic.
In `@pkg/management/create_user_defined_alert_rule.go`:
- Around line 97-124: rulesHaveEquivalentSpec currently compares PromQL
expressions with a raw string comparison (a.Expr.String() != b.Expr.String()),
which is whitespace-sensitive and can mismatch rules that GetAlertingRuleId
treats as identical; change rulesHaveEquivalentSpec to compare normalized
expressions by applying the same normalizeExpr used by GetAlertingRuleId (or
make normalizeExpr exported from the alertrule package and call
alertrule.NormalizeExpr here), then compare the normalized strings; ensure
imports are updated and keep existing checks for For and filterBusinessLabels
unchanged so equivalence now uses normalized expressions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 02cd0c02-2be5-4abb-bf99-14c9b7e5a79e
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (34)
DockerfileDockerfile.devDockerfile.dev-mcpDockerfile.devspaceDockerfile.konfluxDockerfile.mcpMakefilego.modinternal/managementrouter/create_alert_rule.gointernal/managementrouter/create_alert_rule_test.gointernal/managementrouter/managementrouter_suite_test.gointernal/managementrouter/query_filters.gointernal/managementrouter/router.gopkg/alert_rule/alert_rule.gopkg/k8s/alert_relabel_config.gopkg/k8s/alerting_rule.gopkg/k8s/client.gopkg/k8s/external_management.gopkg/k8s/relabeled_rules.gopkg/k8s/types.gopkg/management/alert_rule_id_match.gopkg/management/alert_rule_preconditions.gopkg/management/client_factory.gopkg/management/create_platform_alert_rule.gopkg/management/create_platform_alert_rule_test.gopkg/management/create_user_defined_alert_rule.gopkg/management/create_user_defined_alert_rule_test.gopkg/management/errors.gopkg/management/label_utils.gopkg/management/management.gopkg/management/testutils/k8s_client_mock.gopkg/management/types.gopkg/managementlabels/management_labels.gopkg/server.go
| module github.com/openshift/monitoring-plugin | ||
|
|
||
| go 1.24.0 | ||
| go 1.25.0 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== go.mod directive =="
rg -n '^go\s+[0-9]+\.[0-9]+' go.mod
echo
echo "== Go versions referenced by Dockerfiles =="
fd '^Dockerfile(\..+)?$' -t f | while read -r f; do
echo "--- ${f}"
rg -n 'openshift-golang-builder|golang_[0-9]+\.[0-9]+|go[0-9]+\.[0-9]+' "$f" || true
doneRepository: openshift/monitoring-plugin
Length of output: 914
Align go.mod toolchain with builder images before merge.
go.mod line 3 specifies Go 1.25.0, but Dockerfiles (Dockerfile.dev, Dockerfile.dev-mcp, Dockerfile.devspace, Dockerfile.konflux, Dockerfile.mcp) all reference openshift-golang-builder:rhel_9_golang_1.24 or rhel_9_1.24. This version mismatch will cause build failures in CI and dev containers.
Either bump all builder images to Go 1.25 or revert the go directive in go.mod to 1.24.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@go.mod` at line 3, go.mod currently declares "go 1.25.0" which mismatches the
project builder images referencing Go 1.24; update either the builder images to
use Go 1.25 or revert the go directive in go.mod to 1.24 so CI and dev
containers use the same toolchain. Locate the go directive in go.mod ("go
1.25.0") and change it to "go 1.24" to match the Dockerfiles (or update the
Dockerfiles to use openshift-golang-builder with 1.25 consistently), then run go
mod tidy and verify builds locally to ensure the change resolves the mismatch.
| if !cache.WaitForNamedCacheSync("RelabeledRulesConfig informer", ctx.Done(), | ||
| rrm.prometheusRulesInformer.HasSynced, | ||
| rrm.secretInformer.HasSynced, | ||
| ) { | ||
| return nil, fmt.Errorf("failed to sync RelabeledRulesConfig informer") | ||
| } | ||
|
|
||
| go rrm.worker(ctx) | ||
| rrm.queue.Add("initial-sync") |
There was a problem hiding this comment.
Warm the cache before exposing this manager.
After WaitForNamedCacheSync, relabeledRules is still empty until the worker eventually consumes "initial-sync". Early POST /api/v1/alerting/rules requests can therefore miss existing-rule conflicts right after startup. Run one successful sync(ctx) (or wait for the first worker pass) before returning.
🔧 Example fix
- go rrm.worker(ctx)
- rrm.queue.Add("initial-sync")
+ if err := rrm.sync(ctx); err != nil {
+ return nil, fmt.Errorf("initial relabeled-rules sync: %w", err)
+ }
+
+ go rrm.worker(ctx)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if !cache.WaitForNamedCacheSync("RelabeledRulesConfig informer", ctx.Done(), | |
| rrm.prometheusRulesInformer.HasSynced, | |
| rrm.secretInformer.HasSynced, | |
| ) { | |
| return nil, fmt.Errorf("failed to sync RelabeledRulesConfig informer") | |
| } | |
| go rrm.worker(ctx) | |
| rrm.queue.Add("initial-sync") | |
| if !cache.WaitForNamedCacheSync("RelabeledRulesConfig informer", ctx.Done(), | |
| rrm.prometheusRulesInformer.HasSynced, | |
| rrm.secretInformer.HasSynced, | |
| ) { | |
| return nil, fmt.Errorf("failed to sync RelabeledRulesConfig informer") | |
| } | |
| if err := rrm.sync(ctx); err != nil { | |
| return nil, fmt.Errorf("initial relabeled-rules sync: %w", err) | |
| } | |
| go rrm.worker(ctx) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/k8s/relabeled_rules.go` around lines 142 - 150, After cache sync returns,
ensure the manager performs an initial successful sync before being exposed:
call the existing sync(ctx) (or trigger/wait for the first worker pass) after
launching rrm.worker(ctx) and before returning from the constructor so
relabeledRules is populated; specifically, replace or augment the current
rrm.queue.Add("initial-sync") behavior by invoking rrm.sync(ctx) (or blocking
until rrm.worker's first pass completes) and only return the manager once that
initial sync completes successfully to prevent early POST requests from missing
existing-rule conflicts.
| for _, rule := range group.Rules { | ||
| // Only process alerting rules (skip recording rules) | ||
| if rule.Alert == "" { | ||
| continue | ||
| } | ||
|
|
||
| // Compute a deterministic id from the rule spec. | ||
| // Do not trust any user-provided value in openshift_io_alert_rule_id since | ||
| // PrometheusRule content (including labels) can be tampered with. | ||
| alertRuleId := alertrule.GetAlertingRuleId(&rule) | ||
| if _, exists := seenIDs[alertRuleId]; exists { | ||
| // A second rule that computes to the same id is ambiguous/unsupported (a "true clone"). | ||
| // Don't silently overwrite the first rule in the cache. | ||
| log.Warnf("Duplicate alert rule id %q computed for %s/%s (alert=%q); skipping duplicate", alertRuleId, promRule.Namespace, promRule.Name, rule.Alert) | ||
| continue | ||
| } | ||
| seenIDs[alertRuleId] = struct{}{} | ||
|
|
||
| if rule.Labels == nil { | ||
| rule.Labels = make(map[string]string) | ||
| } | ||
|
|
||
| rule.Labels[managementlabels.AlertNameLabel] = rule.Alert | ||
|
|
||
| if rrm.namespaceManager.IsClusterMonitoringNamespace(promRule.Namespace) { | ||
| lb := labels.NewBuilder(labels.FromMap(rule.Labels)) | ||
| keep := relabel.ProcessBuilder(lb, relabelConfigs...) | ||
| if !keep { | ||
| log.Infof("Skipping dropped alert %s from %s/%s", rule.Alert, promRule.Namespace, promRule.Name) | ||
| continue | ||
| } | ||
|
|
||
| rule.Labels = lb.Labels().Map() | ||
| } | ||
|
|
||
| rule.Labels[AlertRuleLabelId] = alertRuleId | ||
| rule.Labels[PrometheusRuleLabelNamespace] = promRule.Namespace | ||
| rule.Labels[PrometheusRuleLabelName] = promRule.Name | ||
|
|
||
| if arName := alertingRuleOwner(promRule); arName != "" { | ||
| rule.Labels[managementlabels.AlertingRuleLabelName] = arName | ||
| } | ||
|
|
||
| ruleManagedBy, relabelConfigManagedBy := rrm.determineManagedBy(ctx, promRule, alertRuleId) | ||
| if ruleManagedBy != "" { | ||
| rule.Labels[managementlabels.RuleManagedByLabel] = ruleManagedBy | ||
| } | ||
| if relabelConfigManagedBy != "" { | ||
| rule.Labels[managementlabels.RelabelConfigManagedByLabel] = relabelConfigManagedBy |
There was a problem hiding this comment.
Clone the rule labels before adding synthetic labels.
rule is only a shallow copy of the group.Rules entry. When the original rule already has a Labels map, the writes on Lines 291-317 mutate the informer-cached PrometheusRule object, which client-go treats as read-only. That can corrupt the shared cache and race with other readers.
🔧 Example fix
- for _, group := range promRule.Spec.Groups {
- for _, rule := range group.Rules {
+ for _, group := range promRule.Spec.Groups {
+ for _, cachedRule := range group.Rules {
+ rule := cachedRule
+ if cachedRule.Labels != nil {
+ rule.Labels = make(map[string]string, len(cachedRule.Labels))
+ for k, v := range cachedRule.Labels {
+ rule.Labels[k] = v
+ }
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/k8s/relabeled_rules.go` around lines 269 - 317, The loop is mutating the
informer-cached PrometheusRule because `rule` is a shallow copy and writes to
`rule.Labels` modify the original; fix by cloning the labels map before any
writes: in the loop over `group.Rules`, if `rule.Labels` is non-nil create a new
map and copy all key/value pairs into it (otherwise allocate a new map), assign
that new map back to `rule.Labels`, then proceed to add synthetic labels (e.g.
setting managementlabels.AlertNameLabel, AlertRuleLabelId,
PrometheusRuleLabelNamespace/Name and managed-by labels) and run relabel
processing (`labels.NewBuilder` / `relabel.ProcessBuilder`) against the cloned
map so you never mutate the informer-cached object.
| func (rrm *relabeledRulesManager) List(ctx context.Context) []monitoringv1.Rule { | ||
| rrm.mu.RLock() | ||
| defer rrm.mu.RUnlock() | ||
|
|
||
| var result []monitoringv1.Rule | ||
| for _, rule := range rrm.relabeledRules { | ||
| result = append(result, rule) | ||
| } | ||
|
|
||
| return result |
There was a problem hiding this comment.
Do not return shallow copies of cached rules and configs.
These methods only copy the outer slice/struct. Callers still receive shared Labels/Annotations/For data and the same *relabel.Config pointers, so a caller can mutate rrm state after the lock is released. Return deep copies before exposing cached data.
Also applies to: 430-446
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/k8s/relabeled_rules.go` around lines 418 - 427, The List method on
relabeledRulesManager currently returns shallow copies of cached
monitoringv1.Rule objects (rrm.relabeledRules), exposing shared
Labels/Annotations/For maps and *relabel.Config pointers; change it to return
deep copies: while holding the lock, for each rule produce a full deep copy (use
the Rule.DeepCopy() helper if available, otherwise copy primitive fields and
make new maps for Labels/Annotations, new slices for For/Exprs, and clone each
relabel.Config into a new instance) and append the deep copy to the result
slice; apply the same deep-copy approach to the other cache-returning method(s)
referenced around lines 430-446 (e.g., any List*/Get* that returns cached rules
or relabel configs) so callers never receive shared internal pointers or maps
after the lock is released.
| // validatePlatformUpdatePreconditions enforces constraints before ARC-based update. | ||
| // pr may be nil if not fetched yet; arc may be nil if absent. | ||
| func validatePlatformUpdatePreconditions(relabeled monitoringv1.Rule, pr *monitoringv1.PrometheusRule, arc *osmv1.AlertRelabelConfig) error { | ||
| // Rule-level GitOps block | ||
| if isRuleManagedByGitOpsLabel(relabeled) { | ||
| return notAllowedGitOpsEdit() | ||
| } | ||
| // PR metadata GitOps block | ||
| if pr != nil { | ||
| if gitOpsManaged, _ := k8s.IsExternallyManagedObject(pr); gitOpsManaged { | ||
| return notAllowedGitOpsEdit() | ||
| } | ||
| } | ||
| // ARC metadata GitOps block | ||
| if arc != nil && k8s.IsManagedByGitOps(arc.Annotations, arc.Labels) { | ||
| return notAllowedGitOpsEdit() | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify if operator management check is handled elsewhere for platform updates
# or if this is an intentional omission
# Check how validatePlatformUpdatePreconditions is called
rg -n "validatePlatformUpdatePreconditions" --type go -C5
# Check if there's other operator management blocking for platform updates
rg -n "operatorManaged.*platform" --type go -iRepository: openshift/monitoring-plugin
Length of output: 1116
🏁 Script executed:
#!/bin/bash
# Read the complete validatePlatformUpdatePreconditions function
cat -n pkg/management/alert_rule_preconditions.go | sed -n '80,98p'Repository: openshift/monitoring-plugin
Length of output: 911
🏁 Script executed:
#!/bin/bash
# Find all callers of validatePlatformUpdatePreconditions
rg -n "validatePlatformUpdatePreconditions" --type go -B2 -A10 | head -100Repository: openshift/monitoring-plugin
Length of output: 1321
🏁 Script executed:
#!/bin/bash
# Compare with validatePlatformDeletePreconditions - show both functions side by side
echo "=== DELETE PRECONDITIONS ===" && \
cat -n pkg/management/alert_rule_preconditions.go | sed -n '68,78p' && \
echo -e "\n=== UPDATE PRECONDITIONS ===" && \
cat -n pkg/management/alert_rule_preconditions.go | sed -n '80,98p'Repository: openshift/monitoring-plugin
Length of output: 1415
🏁 Script executed:
#!/bin/bash
# Check if IsExternallyManagedObject is used consistently throughout the file
rg -n "IsExternallyManagedObject" --type go -B2 -A2Repository: openshift/monitoring-plugin
Length of output: 3379
validatePlatformUpdatePreconditions only checks GitOps management, not operator management.
Line 89 discards the operatorManaged return value from IsExternallyManagedObject(), while validatePlatformDeletePreconditions (lines 68-78) correctly checks both gitOpsManaged and operatorManaged. This asymmetry allows operator-managed PrometheusRules to be updated via ARC when they should be blocked—inconsistent with delete preconditions which properly reject operator-managed resources.
Either add operator management check on PR and ARC to match the delete logic, or add a comment explaining why operator management is intentionally excluded from platform updates.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/management/alert_rule_preconditions.go` around lines 80 - 98,
validatePlatformUpdatePreconditions currently ignores the operatorManaged
boolean from k8s.IsExternallyManagedObject and doesn't check operator-managed
status for arc, allowing operator-managed resources to be updated; update the
function to mirror validatePlatformDeletePreconditions by capturing both return
values from k8s.IsExternallyManagedObject(pr) and returning the same rejection
when operatorManaged is true, and add an operator-managed check for arc (using
the corresponding k8s helper that checks annotations/labels) to return the same
not-allowed error when arc is operator-managed.
| func (c *client) IsPlatformAlertRule(prId types.NamespacedName) bool { | ||
| return c.k8sClient.Namespace().IsClusterMonitoringNamespace(prId.Namespace) |
There was a problem hiding this comment.
Namespace label is too broad for platform-rule detection.
Line 14 ignores prId.Name and only checks whether the namespace is labeled openshift.io/cluster-monitoring=true. pkg/k8s/namespace.go shows that is just a namespace-level monitoring label, not proof that the specific PrometheusRule is platform-owned. Downstream, CreateUserDefinedAlertRule uses this to reject writes, so opted-in namespaces will be falsely treated as platform-managed.
Please base this on the actual PrometheusRule provenance/ownership, or at least a narrower platform-namespace policy, instead of the cluster-monitoring label alone.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/management/management.go` around lines 13 - 14, The current
IsPlatformAlertRule(prId types.NamespacedName) only checks
Namespace().IsClusterMonitoringNamespace and ignores the specific PrometheusRule
identity, causing false positives; update IsPlatformAlertRule to fetch the
PrometheusRule identified by prId (e.g., via c.k8sClient.Get/PrometheusRule
client), then determine platform ownership from the rule’s metadata
(ownerReferences where Controller==true and kind indicates platform controller,
or a platform-specific annotation such as a managed-by/owned-by key), and only
return true when that explicit provenance indicates platform ownership; if
fetching the resource is impossible, fall back to a stricter namespace policy
(e.g., a whitelist of known platform namespaces) rather than the broad
cluster-monitoring label; update any callers like CreateUserDefinedAlertRule to
use the revised IsPlatformAlertRule signature/behavior.
| type MockClient struct { | ||
| TestConnectionFunc func(ctx context.Context) error | ||
| PrometheusRulesFunc func() k8s.PrometheusRuleInterface | ||
| AlertRelabelConfigsFunc func() k8s.AlertRelabelConfigInterface | ||
| AlertingRulesFunc func() k8s.AlertingRuleInterface | ||
| RelabeledRulesFunc func() k8s.RelabeledRulesInterface | ||
| NamespaceFunc func() k8s.NamespaceInterface | ||
| } |
There was a problem hiding this comment.
Return stable mock interfaces from MockClient.
With the current defaults, each accessor allocates a new empty mock. A test that writes through client.AlertingRules().Create(...) and then reads through client.AlertingRules().Get(...) will hit two different stores, unlike the real client. Keep one lazily initialized mock instance per interface and return the same object every time.
🔧 Example fix
type MockClient struct {
TestConnectionFunc func(ctx context.Context) error
PrometheusRulesFunc func() k8s.PrometheusRuleInterface
AlertRelabelConfigsFunc func() k8s.AlertRelabelConfigInterface
AlertingRulesFunc func() k8s.AlertingRuleInterface
RelabeledRulesFunc func() k8s.RelabeledRulesInterface
NamespaceFunc func() k8s.NamespaceInterface
+
+ prometheusRules k8s.PrometheusRuleInterface
+ alertRelabelConfigs k8s.AlertRelabelConfigInterface
+ alertingRules k8s.AlertingRuleInterface
+ relabeledRules k8s.RelabeledRulesInterface
+ namespace k8s.NamespaceInterface
}
func (m *MockClient) PrometheusRules() k8s.PrometheusRuleInterface {
if m.PrometheusRulesFunc != nil {
return m.PrometheusRulesFunc()
}
- return &MockPrometheusRuleInterface{}
+ if m.prometheusRules == nil {
+ m.prometheusRules = &MockPrometheusRuleInterface{}
+ }
+ return m.prometheusRules
}Apply the same lazy-init pattern to the other accessors.
Also applies to: 34-70
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/management/testutils/k8s_client_mock.go` around lines 16 - 23, The
accessors on MockClient (PrometheusRulesFunc, AlertRelabelConfigsFunc,
AlertingRulesFunc, RelabeledRulesFunc, NamespaceFunc) currently return a new
mock on each call; change them to lazily initialize and memoize one backing mock
instance per interface on the MockClient struct (e.g., add fields like
prometheusRules, alertRelabelConfigs, alertingRules, relabeledRules, namespace)
and have each accessor check the backing field, create and store the mock if
nil, then return the stored instance so repeated calls (e.g.,
AlertingRules().Create(...) followed by AlertingRules().Get(...)) operate on the
same mock store; keep TestConnectionFunc behavior unchanged.
| // Initialize management client if management API feature is enabled | ||
| var managementClient management.Client | ||
| if alertManagementAPIMode { | ||
| k8sClient, err := k8s.NewClient(ctx, k8sconfig) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to create k8s client for alert management API: %w", err) | ||
| } | ||
|
|
||
| if err := k8sClient.TestConnection(ctx); err != nil { | ||
| return nil, fmt.Errorf("failed to connect to kubernetes cluster for alert management API: %w", err) | ||
| } | ||
|
|
||
| managementClient = management.New(ctx, k8sClient) | ||
| log.Info("alert management API enabled") | ||
| } |
There was a problem hiding this comment.
Management client init can hang the whole server startup.
On Lines 156-165, k8s.NewClient runs during server construction under the server-lifetime context. pkg/k8s/client.go initializes several informer-backed managers, and those managers wait for cache sync before returning. A slow/unreachable API server or missing CRD can therefore wedge plugin startup indefinitely once this feature flag is enabled.
Please add a bounded initialization path here, or split informer lifetime from init-time sync, so startup fails fast instead of hanging the rollout.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/server.go` around lines 153 - 167, The current synchronous initialization
of the Kubernetes management client (managementClient) uses k8s.NewClient and
k8sClient.TestConnection under the long-lived server context and can block
startup; wrap the creation and connection check in a bounded initialization
(e.g., create a child context with a short timeout or deadline and use that for
k8s.NewClient and k8sClient.TestConnection) and fail fast if the timeout
elapses, or alternatively refactor so NewClient returns a client that starts
informers asynchronously (no cache-sync wait) and performs cache sync in the
background while the server continues to start; update the code paths
referencing managementClient to handle startup-time timeout failures or a
not-yet-synced informer client accordingly (references: managementClient,
k8s.NewClient, k8sClient.TestConnection, pkg/k8s/client.go).
Add AlertingRule, AlertRelabelConfig, and
RelabeledRules CRD interfaces with the
management client, router, server wiring,
and POST /api/v1/alerting/rules endpoint.
Signed-off-by: Shirly Radco sradco@redhat.com
Signed-off-by: João Vilaça jvilaca@redhat.com
Signed-off-by: Aviv Litman alitman@redhat.com
Co-authored-by: AI Assistant noreply@cursor.com